A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
In this paper, we present ExtremeBERT, a toolkit for accelerating and customizing BERT pretraining. Our goal is to provide an easy-to-use BERT pretraining toolkit for the research community and industry. Thus, the pretraining of popular language models on customized datasets is affordable with limited resources. Experiments show that, to achieve the same or better GLUE scores, the time cost of our toolkit is over $6\times$ times less for BERT Base and $9\times$ times less for BERT Large when compared with the original BERT paper. The documentation and code are released at https://github.com/extreme-bert/extreme-bert under the Apache-2.0 license.
translated by 谷歌翻译
在人类环境中,预计在简单的自然语言指导下,机器人将完成各种操纵任务。然而,机器人的操纵极具挑战性,因为它需要精细颗粒的运动控制,长期记忆以及对以前看不见的任务和环境的概括。为了应对这些挑战,我们提出了一种基于统一的变压器方法,该方法考虑了多个输入。特别是,我们的变压器体系结构集成了(i)自然语言指示和(ii)多视图场景观察,而(iii)跟踪观察和动作的完整历史。这种方法使历史和指示之间的学习依赖性可以使用多个视图提高操纵精度。我们评估我们的方法在具有挑战性的RLBench基准和现实世界机器人方面。值得注意的是,我们的方法扩展到74个不同的RLBench任务,并超越了最新的现状。我们还解决了指导条件的任务,并证明了对以前看不见的变化的出色概括。
translated by 谷歌翻译
在视觉和语言导航(VLN)中,按照自然语言指令在现实的3D环境中需要具体的代理。现有VLN方法的一个主要瓶颈是缺乏足够的培训数据,从而导致对看不见的环境的概括不令人满意。虽然通常会手动收集VLN数据,但这种方法很昂贵,并且可以防止可扩展性。在这项工作中,我们通过建议从HM3D自动创建900个未标记的3D建筑物的大规模VLN数据集来解决数据稀缺问题。我们为每个建筑物生成一个导航图,并通过交叉视图一致性从2D传输对象预测,从2D传输伪3D对象标签。然后,我们使用伪对象标签来微调一个预处理的语言模型,作为减轻教学生成中跨模式差距的提示。在导航环境和说明方面,我们生成的HM3D-AUTOVLN数据集是比现有VLN数据集大的数量级。我们通过实验表明,HM3D-AUTOVLN显着提高了所得VLN模型的概括能力。在SPL指标上,我们的方法分别在Reverie和DataSet的看不见的验证分裂分别对艺术的状态提高了7.1%和8.1%。
translated by 谷歌翻译
稀疏培训是一种自然的想法,可以加速深度神经网络的训练速度,并节省内存使用,特别是因为大型现代神经网络被显着过度参数化。然而,大多数现有方法在实践中无法实现这一目标,因为先前方法采用的基于链规则的梯度(W.R.T.结构参数)估计。至少在向后传播步骤中至少需要密集的计算。本文通过提出具有完全稀疏的前后通行证的有效稀疏训练方法来解决这个问题。我们首先在全球稀疏限制下将培训过程制定为连续最小化问题。然后,我们将优化过程分为两个步骤,对应于权重更新和结构参数更新。对于前一步,我们使用传统的链规则,这可以通过利用稀疏结构来稀疏。对于后一步,而不是使用基于链规则的梯度估计器,如现有方法中,我们提出了一个方差减少的策略梯度估计器,这只需要两个向前通过而不向后传播,从而实现完全稀疏的训练。我们证明了我们渐变估计器的差异是界定的。对现实世界数据集的广泛实验结果表明,与以前的方法相比,我们的算法在加速训练过程中更有效,速度快到速度更快。
translated by 谷歌翻译
分散数据的数值插值旨在根据某些观察到的点估算目标点的值。传统方法通过构建结合多个基础函数的插值函数来产生估计。这些方法要求明确定义基础功能,从而在实际情况下极大地限制了其应用。最近的进步表现出一种替代策略,该策略可以直接使用机器学习技术(例如深度神经网络)从观察到的点学习插值功能。该策略虽然很有希望,但不能有效利用观察到的点和目标点之间的相关性,因为它可以分别处理这些类型的点。在这里,我们提出了一种基于学习的方法,使用变压器的编码器表示(因此称为NIERT)。 Niert将每个目标点的值视为蒙版令牌,它可以以统一的方式处理目标点并观察到点。通过计算目标点和观察点之间的部分自我注意,NIERT获得了利用这些点之间相关性的优势,更重要的是,避免了目标点在观察到的点上意外干扰。 NIERT还使用预训练技术进一步提高其准确性。在三个代表性数据集上,包括两个合成数据集和一个现实世界数据集,Niert优于现有方法,例如,在用于温度字段重建的TFRD-ADLET数据集上,Niert达到了$ 1.897 \ times 10^{ - 3} $ $ 1.897 ,比基于变压器的方法要好得多(MAE:$ 27.074 \ times 10^{ - 3} $)。这些结果清楚地表明了NIERT的准确性及其在多个实际领域中应用的潜力。
translated by 谷歌翻译
随着视觉前训练的成功,我们目睹了最先进的方式,以多模式的理解和产生推动。但是,当前的预训练范式不能一次靶向所有模式(例如,文本生成和图像生成),或者需要多重设计良好的任务,从而显着限制可伸缩性。我们证明,可以通过文本和图像序列的前缀语言建模目标学习统一的模态模型。得益于简单但功能强大的预训练范式,我们提出的模型Davinci非常易于训练,可扩展到巨大的数据,并且可以适应跨模态(语言 /视觉 /视觉+语言)的各种下游任务(类型)(理解) / generation)和设置(例如,零射,微调,线性评估)具有单个统一体系结构。达文奇(Davinci)在26个理解 /发电任务的广泛范围内实现了竞争性能,并且在大多数任务上都超过了以前的统一视力语言模型,包括Imagenet分类(+1.6%),VQAV2(+1.4%)(+1.4%),可可标题生成(Bleu@@@@@ 4 +1.1%,苹果酒 +1.5%)和可可图像生成( +0.9%,FID -1.0%),在可比的模型和数据量表处。此外,我们通过在异质和广泛的分布覆盖范围内报告不同尺度的量表上的性能,为将来的研究提供了明确的基准。我们的结果建立了新的,更强的基线,以便将来在不同的数据量表上进行比较,并阐明了更广泛地比较VLP模型的困难。
translated by 谷歌翻译
Keyphrase generation aims to produce a set of phrases summarizing the essentials of a given document. Conventional methods normally apply an encoder-decoder architecture to generate the output keyphrases for an input document, where they are designed to focus on each current document so they inevitably omit crucial corpus-level information carried by other similar documents, i.e., the cross-document dependency and latent topics. In this paper, we propose CDKGen, a Transformer-based keyphrase generator, which expands the Transformer to global attention with cross-document attention networks to incorporate available documents as references so as to generate better keyphrases with the guidance of topic information. On top of the proposed Transformer + cross-document attention architecture, we also adopt a copy mechanism to enhance our model via selecting appropriate words from documents to deal with out-of-vocabulary words in keyphrases. Experiment results on five benchmark datasets illustrate the validity and effectiveness of our model, which achieves the state-of-the-art performance on all datasets. Further analyses confirm that the proposed model is able to generate keyphrases consistent with references while keeping sufficient diversity. The code of CDKGen is available at https://github.com/SVAIGBA/CDKGen.
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译